[https://nvbugs/6025177][fix] Fix KV cache issue (cherry-pick to release/1.3.0rc5.post2) by thorjohnsen · Pull Request #12819 · NVIDIA/TensorRT-LLM

thorjohnsen · 2026-04-07T22:03:02Z

Summary

Cherry-pick of commit b444973 into release/1.3.0rc5.post2

Test plan

Verify KV cache fix applies cleanly
Run KV cache manager unit tests: cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
Run cache transceiver multi-GPU tests: cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Improved KV cache reuse logic to accurately compute materialized token counts, ensuring proper context extent handling during chunked prefill and generation phases.
Tests
- Added test utilities for KV cache manager validation.
- Added new test cases validating materialized context extent behavior.
- Expanded test coverage across KV cache reuse, block management, and cache transceivers.

…ase/1.3.0rc5.post2) Cherry-pick of NVIDIA#12673 onto release/1.3.0rc5.post2. Fixes KV cache corruption caused by storing blocks with over-counted unique token extent during chunked prefill. Introduces getUsableUniqueTokenCountForReuse() and getMaterializedUniqueTokenCountForReuse() helpers to correctly cap the number of tokens stored for reuse. Also moves simulatePrefillCompletion to a KvCacheManagerTestUtil test utility class to avoid polluting the production interface. Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>

coderabbitai · 2026-04-07T22:12:46Z

📝 Walkthrough

Walkthrough

This PR refactors KV cache reuse token counting logic by introducing helper functions that compute accurate usable unique token counts based on context state, replacing hardcoded assumptions. It adds test utilities to simulate prefill completion for unit tests, exposes these utilities via nanobind Python bindings, and updates multiple unit tests to use the new simulation utility.

Changes

Cohort / File(s)	Summary
Core KV Cache Manager Logic `cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp`	Added `getMaterializedUniqueTokenCountForReuse` and `getUsableUniqueTokenCountForReuse` helper functions to compute context-aware reusable token counts. Updated reuse token chopping logic throughout to use computed usable count instead of `uniqueTokens.size() - 1`.
Testing Utility (C++) `cpp/tensorrt_llm/testing/kvCacheManagerTestUtil.h`	New utility header defining `KvCacheManagerTestUtil` class with static method `simulatePrefillCompletion` that sets context current position to prompt length for unit test simulation.
Nanobind Binding Infrastructure `cpp/tensorrt_llm/nanobind/CMakeLists.txt`, `cpp/tensorrt_llm/nanobind/bindings.cpp`, `cpp/tensorrt_llm/nanobind/testing/kvCacheManagerTestUtilBinding.*`	Added new nanobind source file to build, extended module initialization to register `initKvCacheTestUtilBindings`, and created bindings module exposing `simulate_prefill_completion_only_use_for_testing` Python function.
C++ Unit Tests `cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp`, `cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp`, `cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp`	Integrated calls to `simulatePrefillCompletion` in existing tests. Added two new test cases (`KVCacheManagerStoreContextBlocksUsesMaterializedContextExtent`, `KVCacheManagerReleaseBlocksUsesMaterializedContextExtent`) validating materialized context extent behavior.
Python Unit Tests `tests/unittest/_torch/executor/test_resource_manager.py`, `tests/unittest/llmapi/test_llm_kv_cache_events.py`	Imported and integrated `simulate_prefill_completion_only_use_for_testing` binding into test flows immediately before resource cleanup operations.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 11.54% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly identifies the PR as a fix for a KV cache issue (NVBugs `#6025177`) and indicates it is a cherry-pick to a release branch, which matches the changeset's purpose.
Description check	✅ Passed	The PR description includes a concise summary of the cherry-pick, test plan with specific test files, and verification steps, but lacks detailed explanation of the issue and solution as recommended by the template.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (1)

5184-5189: ⚠️ Potential issue | 🟠 Major

Apply the simulated prefill completion to the stored requests, not a copy.

Line 5184 iterates llmRequests by value, so Line 5188 only updates a temporary. The vector returned from fillKvCacheManager(...) still contains requests with stale prefill state, and the later removeSequence(...) calls in FillKvCacheAndCompleteRequestsTest will keep exercising the wrong state.

🔧 Suggested fix

-    for (auto request : llmRequests)
+    for (auto& request : llmRequests)
     {
         kvCacheManager.addSequence(request.mRequestId, request.getPromptLen(), maxBeamWidth, request);
         request.mState = LlmRequestState::kGENERATION_IN_PROGRESS;
         tensorrt_llm::testing::KvCacheManagerTestUtil::simulatePrefillCompletion(request);
         kvCacheManager.storeContextBlocks(request);
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp` around lines 5184
- 5189, The loop over llmRequests is iterating by value so
simulatePrefillCompletion and the state change only affect a copy; change the
loop to iterate by reference (e.g., for (auto& request : llmRequests)) so
kvCacheManager.addSequence, setting request.mState =
LlmRequestState::kGENERATION_IN_PROGRESS,
tensorrt_llm::testing::KvCacheManagerTestUtil::simulatePrefillCompletion(request),
and kvCacheManager.storeContextBlocks(request) modify the actual entries
returned by fillKvCacheManager(...) and later removeSequence(...) will see the
updated prefill state in the FillKvCacheAndCompleteRequestsTest.

🧹 Nitpick comments (1)

cpp/tensorrt_llm/testing/kvCacheManagerTestUtil.h (1)

29-40: Narrow this helper's contract.

It only calls setContextCurrentPosition(getPromptLen()); LlmRequestState and the rest of the post-prefill bookkeeping stay unchanged. Either drive the full post-prefill transition here or rename/doc it as a materialization-only shim so tests do not assume stronger behavior.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/testing/kvCacheManagerTestUtil.h` around lines 29 - 40, The
helper simulatePrefillCompletion only sets
llmRequest.setContextCurrentPosition(llmRequest.getPromptLen()) but leaves
LlmRequestState and other post-prefill bookkeeping unchanged; either make it
perform the full post-prefill transition (update LlmRequestState and any other
post-prefill flags/members) or narrow its contract by renaming and documenting
it as a materialization-only shim. Specifically, either implement the full
post-prefill steps on batch_manager::LlmRequest (matching what production
prefill completion does) inside simulatePrefillCompletion, or rename the
function (e.g., simulatePrefillMaterialization) and update its comment to state
it only calls setContextCurrentPosition/getPromptLen and does NOT modify
LlmRequestState or other bookkeeping, and update all tests to use the new
name/expectations.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp`:
- Around line 403-406: The call to
tensorrt_llm::testing::KvCacheManagerTestUtil::simulatePrefillCompletion(*)
inside the if (llmReq->getContextRemainingLength() == 0) branch is a no-op
because remaining==0 already implies llmReq->getContextCurrentPosition() ==
llmReq->getPromptLen(); to actually exercise the materialized-prefix path either
remove the simulatePrefillCompletion() call from this branch or change the test
setup/condition so simulatePrefillCompletion() runs when
contextRemainingLength() > 0 (or set llmReq so getContextCurrentPosition() <
getPromptLen() before calling simulatePrefillCompletion()), then call
kvCacheManager->storeContextBlocks(*llmReq) — update the condition or request
initialization accordingly to ensure simulatePrefillCompletion() meaningfully
mutates the request state.

In `@cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp`:
- Line 34: Update the NVIDIA copyright header in the modified file
kvCacheManagerTest.cpp to reflect 2026 (the latest meaningful modification
year); locate the file header at the top of kvCacheManagerTest.cpp (the test
file that includes "tensorrt_llm/testing/kvCacheManagerTestUtil.h") and change
the year range/end year to 2026 so the header is current.

---

Outside diff comments:
In `@cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp`:
- Around line 5184-5189: The loop over llmRequests is iterating by value so
simulatePrefillCompletion and the state change only affect a copy; change the
loop to iterate by reference (e.g., for (auto& request : llmRequests)) so
kvCacheManager.addSequence, setting request.mState =
LlmRequestState::kGENERATION_IN_PROGRESS,
tensorrt_llm::testing::KvCacheManagerTestUtil::simulatePrefillCompletion(request),
and kvCacheManager.storeContextBlocks(request) modify the actual entries
returned by fillKvCacheManager(...) and later removeSequence(...) will see the
updated prefill state in the FillKvCacheAndCompleteRequestsTest.

---

Nitpick comments:
In `@cpp/tensorrt_llm/testing/kvCacheManagerTestUtil.h`:
- Around line 29-40: The helper simulatePrefillCompletion only sets
llmRequest.setContextCurrentPosition(llmRequest.getPromptLen()) but leaves
LlmRequestState and other post-prefill bookkeeping unchanged; either make it
perform the full post-prefill transition (update LlmRequestState and any other
post-prefill flags/members) or narrow its contract by renaming and documenting
it as a materialization-only shim. Specifically, either implement the full
post-prefill steps on batch_manager::LlmRequest (matching what production
prefill completion does) inside simulatePrefillCompletion, or rename the
function (e.g., simulatePrefillMaterialization) and update its comment to state
it only calls setContextCurrentPosition/getPromptLen and does NOT modify
LlmRequestState or other bookkeeping, and update all tests to use the new
name/expectations.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0855f8a5-dac7-4461-8c53-cb3568dacef1

📥 Commits

Reviewing files that changed from the base of the PR and between fc1f38b and f30dbf2.

📒 Files selected for processing (11)

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/nanobind/CMakeLists.txt
cpp/tensorrt_llm/nanobind/bindings.cpp
cpp/tensorrt_llm/nanobind/testing/kvCacheManagerTestUtilBinding.cpp
cpp/tensorrt_llm/nanobind/testing/kvCacheManagerTestUtilBinding.h
cpp/tensorrt_llm/testing/kvCacheManagerTestUtil.h
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
tests/unittest/_torch/executor/test_resource_manager.py
tests/unittest/llmapi/test_llm_kv_cache_events.py

cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp

cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp

thorjohnsen · 2026-04-07T23:00:12Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-07T23:06:46Z

PR_Github #42198 [ run ] triggered by Bot. Commit: 20131a8 Link to invocation

tensorrt-cicd · 2026-04-08T06:37:56Z

PR_Github #42198 [ run ] completed with state SUCCESS. Commit: 20131a8
/LLM/release-1.3.0rc5.post2/L0_MergeRequest_PR pipeline #7 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

eopXD · 2026-04-08T09:53:13Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-08T09:59:23Z

PR_Github #42322 [ run ] triggered by Bot. Commit: 20131a8 Link to invocation

tensorrt-cicd · 2026-04-08T12:00:13Z

PR_Github #42322 [ run ] completed with state SUCCESS. Commit: 20131a8
/LLM/release-1.3.0rc5.post2/L0_MergeRequest_PR pipeline #10 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>

thorjohnsen · 2026-04-08T16:52:06Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-08T16:58:21Z

PR_Github #42365 [ run ] triggered by Bot. Commit: b9b7ab4 Link to invocation

thorjohnsen · 2026-04-08T19:40:24Z

/bot kill

…ntamination_release_1.3.0rc5.post2

tensorrt-cicd · 2026-04-08T19:46:32Z

PR_Github #42389 [ kill ] triggered by Bot. Commit: 059e4c4 Link to invocation

tensorrt-cicd · 2026-04-08T19:47:17Z

PR_Github #42389 [ kill ] completed with state SUCCESS. Commit: 059e4c4
Successfully killed previous jobs for commit 059e4c4

Link to invocation

juney-nvidia · 2026-04-11T12:22:02Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-11T12:29:08Z

PR_Github #42797 [ run ] triggered by Bot. Commit: 059e4c4 Link to invocation

tensorrt-cicd · 2026-04-11T17:29:59Z

PR_Github #42797 [ run ] completed with state SUCCESS. Commit: 059e4c4
/LLM/release-1.3.0rc5.post2/L0_MergeRequest_PR pipeline #18 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

juney-nvidia · 2026-04-12T02:03:43Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-12T02:09:38Z

PR_Github #42829 [ run ] triggered by Bot. Commit: 059e4c4 Link to invocation

tensorrt-cicd · 2026-04-12T04:42:44Z

PR_Github #42829 [ run ] completed with state SUCCESS. Commit: 059e4c4
/LLM/release-1.3.0rc5.post2/L0_MergeRequest_PR pipeline #19 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

thorjohnsen requested a review from a team as a code owner April 7, 2026 22:03

github-actions bot assigned thorjohnsen Apr 7, 2026

coderabbitai bot reviewed Apr 7, 2026

View reviewed changes

cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp Show resolved Hide resolved

cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp Show resolved Hide resolved

thorjohnsen force-pushed the user/tjohnsen/fix_cross_contamination_release_1.3.0rc5.post2 branch from f30dbf2 to 20131a8 Compare April 7, 2026 22:13

pcastonguay approved these changes Apr 8, 2026

View reviewed changes

eopXD approved these changes Apr 8, 2026

View reviewed changes

Bump nixl to 0.9.0

b9b7ab4

Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>

thorjohnsen requested a review from a team as a code owner April 8, 2026 16:51

Merge branch 'release/1.3.0rc5.post2' into user/tjohnsen/fix_cross_co…

059e4c4

…ntamination_release_1.3.0rc5.post2

juney-nvidia approved these changes Apr 11, 2026

View reviewed changes

Conversation

thorjohnsen commented Apr 7, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 7, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

thorjohnsen commented Apr 7, 2026

Uh oh!

tensorrt-cicd commented Apr 7, 2026

Uh oh!

tensorrt-cicd commented Apr 8, 2026

Uh oh!

eopXD commented Apr 8, 2026

Uh oh!

tensorrt-cicd commented Apr 8, 2026

Uh oh!

tensorrt-cicd commented Apr 8, 2026

Uh oh!

thorjohnsen commented Apr 8, 2026

Uh oh!

tensorrt-cicd commented Apr 8, 2026

Uh oh!

thorjohnsen commented Apr 8, 2026

Uh oh!

tensorrt-cicd commented Apr 8, 2026

Uh oh!

tensorrt-cicd commented Apr 8, 2026

Uh oh!

juney-nvidia commented Apr 11, 2026

Uh oh!

tensorrt-cicd commented Apr 11, 2026

Uh oh!

tensorrt-cicd commented Apr 11, 2026

Uh oh!

juney-nvidia commented Apr 12, 2026

Uh oh!

tensorrt-cicd commented Apr 12, 2026

Uh oh!

tensorrt-cicd commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

thorjohnsen commented Apr 7, 2026 •

edited by coderabbitai bot

Loading